Synopsis

This project explores the movies data set from Kaggle and looks to find what factors affect gross revenue the most. The data is first cleaned by removing missing values/duplicates and reformatting some columns for consistency and readability. The analysis of the data looks at correlation coefficients to find out which factors have a positive correlation with gross revenue and the p-value of each correlation coefficient is used to check if the results are statistically significant. The two factors having the highest correlation with gross revenue are then visualized.

Data Overview

Columns
  • budget: the budget of a movie. Some movies don’t have this, so it appears as 0

  • company: the production company

  • country: country of origin

  • director: the director

  • genre: main genre of the movie.

  • gross: revenue of the movie (in USA only)

  • name: name of the movie

  • rating: rating of the movie (R, PG, etc.)

  • released: release date (YYYY-MM-DD)

  • runtime: duration of the movie (minutes)

  • score: IMDb user rating

  • votes: number of user votes

  • star: main actor/actress

  • writer: writer of the movie

  • year: year of release

 

Data Cleaning

Checking for missing values

missing_values <- sapply(movies, function(x) sum(is.na(x)))
Missing Values
name 0
rating 0
genre 0
year 0
released 0
score 3
votes 3
director 0
writer 0
star 0
country 0
budget 2171
gross 189
company 0
runtime 4

A majority of the missing values occur in the ‘budget’ column of the data and the rest is spread across multiple columns. I think that it would be unreasonable and very time consuming to fill all the missing values with information found on the internet. So instead, for this project we will remove rows with missing values from the data. But before we do this, we should check how much of the data will be lost after removing the rows with missing values.

 

percent_missing_values <- colMeans(is.na(movies))*100
% Missing Values
name 0.0000000
rating 0.0000000
genre 0.0000000
year 0.0000000
released 0.0000000
score 0.0391236
votes 0.0391236
director 0.0000000
writer 0.0000000
star 0.0000000
country 0.0000000
budget 28.3124674
gross 2.4647887
company 0.0000000
runtime 0.0521648

After the removal of the rows containing missing values, we will be left with around 70% of the original data. I think this leaves us with an adequate amount of data to analyse and draw some general conclusions from. So we will move along with removing the rows with missing values from the data.

 

Checking for duplicate data

unique_movies <- distinct(movies)

nrow(unique_movies) == nrow(movies)
## [1] TRUE

Next we will check for duplicate data by checking if all the rows of the data are unique by comparing the number of rows that are unique with the number of rows of the original data set. The TRUE output indicates that the number of unique rows is the same as the number of rows in the movies data set, so there are no duplicate values.

 

Reformatting columns

Structure of the data
## 'data.frame':    5435 obs. of  15 variables:
##  $ name    : chr  "The Shining" "The Blue Lagoon" "Star Wars: Episode V - The Empire Strikes Back" "Airplane!" ...
##  $ rating  : chr  "R" "R" "PG" "PG" ...
##  $ genre   : chr  "Drama" "Adventure" "Action" "Comedy" ...
##  $ year    : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ released: chr  "June 13, 1980 (United States)" "July 2, 1980 (United States)" "June 20, 1980 (United States)" "July 2, 1980 (United States)" ...
##  $ score   : num  8.4 5.8 8.7 7.7 7.3 6.4 7.9 8.2 6.8 7 ...
##  $ votes   : num  927000 65000 1200000 221000 108000 123000 188000 330000 101000 10000 ...
##  $ director: chr  "Stanley Kubrick" "Randal Kleiser" "Irvin Kershner" "Jim Abrahams" ...
##  $ writer  : chr  "Stephen King" "Henry De Vere Stacpoole" "Leigh Brackett" "Jim Abrahams" ...
##  $ star    : chr  "Jack Nicholson" "Brooke Shields" "Mark Hamill" "Robert Hays" ...
##  $ country : chr  "United Kingdom" "United States" "United States" "United States" ...
##  $ budget  : num  1.9e+07 4.5e+06 1.8e+07 3.5e+06 6.0e+06 5.5e+05 2.7e+07 1.8e+07 5.4e+07 1.0e+07 ...
##  $ gross   : num  4.70e+07 5.89e+07 5.38e+08 8.35e+07 3.98e+07 ...
##  $ company : chr  "Warner Bros." "Columbia Pictures" "Lucasfilm" "Paramount Pictures" ...
##  $ runtime : num  146 104 124 88 98 95 133 129 127 100 ...

Here is an overview of the data types of each column currently. We first notice that the release date of the movie is a character type in the ‘released’ column, which does not seem correct because we would expect a date data type. This could be due to the column having both date and country of origin in one line. In order to reformat the ‘released’ column to a date data type, we must first isolate the date.

 

movies$released <- gsub("\\s*\\([^\\)]+\\)","",movies$released) 

We will be doing this with the gsub function and using a regular expression to help remove all elements that are contained within brackets.

 

Next we notice that some dates in the ‘released’ column only show year.

 

movies[str_length(movies$released)==4,]$released <- paste0("January 1, ", movies[str_length(movies$released)==4,]$released) 

So we will add a default month and day to the date as we are mainly concerned about the year and want to avoid the mdy() function from converting these values into NULL values instead of a date.

 

movies$released <- mdy(movies$released)

We can now convert the ‘released’ column to the date data type.

 

The next issue we notice is that some years in the ‘released’ column are inconsistent with the ‘year’ column.

 

movies <- mutate(movies, year = year(movies$released))

To fix this, we will use the year from the ‘released’ column to change the ‘year’ column of each movie accordingly for consistency.

 

Preliminary Data Exploration

Before we conduct the data analysis, we can look at the data to generate an initial hypothesis. Ordering the data by gross revenue descending allows us to see that movies that have high gross revenue tend to also have higher budgets. This would make sense because it would be expected for high budget movies to also generate more gross revenue. The other variables appear to have no noticeable relationship with gross revenue upon early inspection of the data. But we will look at the correlation coefficients to check our early hypothesis and find other potential factors that may influence gross revenue.

 

Data Analysis

Looking at correlation coefficients

  year score votes budget gross runtime
year            
score 0.061
(<.001)
         
votes 0.203
(<.001)
0.474
(<.001)
       
budget 0.320
(<.001)
0.072
(<.001)
0.440
(<.001)
     
gross 0.269
(<.001)
0.222
(<.001)
0.615
(<.001)
0.740
(<.001)
   
runtime 0.074
(<.001)
0.415
(<.001)
0.352
(<.001)
0.319
(<.001)
0.276
(<.001)
 
Computed correlation used pearson-method with listwise-deletion.

 

As we can see from both the correlation matrix and table, the gross revenue is most affected by votes and budget of a movie. This could be because the more votes a movie has the more people have likely went to see it and in general higher budget movies would be expected to generate more revenue than those with lower budgets. Also surprisingly, the score seems to have the the least influence on the gross revenue of a movie, although there is still somewhat of a positive correlation between the two variables.

Observing the p-values under each correlation coefficient, we can see that the p-value is less than 0.05 which suggests that the correlation between these variables is statistically significant and that both votes and budget indeed could have a positive relationship with gross revenue.

 

Visualising our findings

 

 

From the visualisations we can see that in general gross revenue has positive correlation with budget and votes. This may suggest that both the budget of a film and votes for a films score may positively influence gross revenue as they increase. Although the correlation between votes and gross revenue appears to be weaker of the two variables.